The Annals of Applied Statistics 
2010, Vol. 4, No. 1, 124-150 
DOI: 10.1214/09-AOAS309 

© Institute of Mathematical Statistics. 2010 



HIERARCHICAL RELATIONAL MODELS 
FOR DOCUMENT NETWORKS 

By Jonathan Chang 1 and David M. Blei 2 

Facebook and Princeton University 

We develop the relational topic model (RTM), a hierarchical 
model of both network structure and node attributes. We focus on 
document networks, where the attributes of each document are its 
words, that is, discrete observations taken from a fixed vocabulary. 
For each pair of documents, the RTM models their link as a binary 
random variable that is conditioned on their contents. The model can 
be used to summarize a network of documents, predict links between 
them, and predict words within them. We derive efficient inference 
and estimation algorithms based on variational methods that take 
advantage of sparsity and scale with the number of links. We eval- 
uate the predictive performance of the RTM for large networks of 
scientific abstracts, web documents, and geographically tagged news. 

1. Introduction. Network data, such as citation networks of documents, 
hyperlinked networks of web pages, and social networks of friends, are per- 
vasive in applied statistics and machine learning. The statistical analysis 
of network data can provide both useful predictive models and descriptive 
statistics. Predictive models can point social network members toward new 
friends, scientific papers toward relevant citations, and web pages toward 
other related pages. Descriptive statistics can uncover the hidden commu- 
nity structure underlying a network data set. 

Recent research in this field has focused on latent variable models of link 
structure, models that decompose a network according to hidden patterns 
of connections between its nodes [Kemp, Griffiths and Tenenbaum (2004); 
Hofman and Wiggins (2007); Airoldi et al. (2008)]. These models represent 
a significant departure from statistical models of networks, which explain 
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network data in terms of observed sufficient statistics [Fienberg, Meyer and 
Wasserman (1985); Wasserman and Pattison (1996); Getoor et al. (2001); 
Newman (2002); Taskar et al. (2004)]. 

While powerful, current latent variable models account only for the struc- 
ture of the network, ignoring additional attributes of the nodes that might be 
available. For example, a citation network of articles also contains text and 
abstracts of the documents, a linked set of web-pages also contains the text 
for those pages, and an on-line social network also contains profile descrip- 
tions and other information about its members. This type of information 
about the nodes, along with the links between them, should be used for 
uncovering, understanding, and exploiting the latent structure in the data. 

To this end, we develop a new model of network data that accounts for 
both links and attributes. While a traditional network model requires some 
observed links to provide a predictive distribution of links for a node, our 
model can predict links using only a new node's attributes. Thus, we can 
suggest citations of newly written papers, predict the likely hyperlinks of a 
web page in development, or suggest friendships in a social network based 
only on a new user's profile of interests. Moreover, given a new node and 
its links, our model provides a predictive distribution of node attributes. 
This mechanism can be used to predict keywords from citations or a user's 
interests from his or her social connections. Such prediction problems are 
out of reach for traditional network models. 

Here we focus on document networks. The attributes of each document 
are its text, that is, discrete observations taken from a fixed vocabulary, and 
the links between documents are connections such as friendships, hyperlinks, 
citations, or adjacency. To model the text, we build on previous research 
in mixed-membership document models, where each document exhibits a 
latent mixture of multinomial distributions or "topics" [Blei, Ng and Jor- 
dan (2003); Erosheva, Fienberg and Lafferty (2004); Steyvers and Griffiths 
(2007)]. The links are then modeled dependent on this latent representation. 
We call our model, which explicitly ties the content of the documents with 
the connections between them, the relational topic model (RTM). 

The RTM affords a significant improvement over previously developed 
models of document networks. Because the RTM jointly models node at- 
tributes and link structure, it can be used to make predictions about one 
given the other. Previous work tends to explore one or the other of these 
two prediction problems. Some previous work uses link structure to make at- 
tribute predictions [Chakrabarti, Dom and Indyk (1998); Kleinberg (1999)], 
including several topic models [McCallum, Corrada-Emmanuel and Wang 
(2005); Wang, Mohanty and McCallum (2005); Dietz, Bickel and Scheffer 
(2007)]. However, none of these methods can make predictions about links 
given words. 
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Other models use node attributes to predict links [Hoff , Raftery and Hand- 
cock (2002)]. However, these models condition on the attributes but do not 
model them. While this may be effective for small numbers of attributes 
of low dimension, these models cannot make meaningful predictions about 
or using high-dimensional attributes such as text data. As our empirical 
study in Section 4 illustrates, the mixed-membership component provides 
dimensionality reduction that is essential for effective prediction. 

In addition to being able to make predictions about links given words and 
words given links, the RTM is able to do so for new documents — documents 
outside of the training data. Approaches which generate document links 
through topic models treat links as discrete "terms" from a separate vo- 
cabulary that essentially indexes the observed documents [Cohn and Hof- 
mann (2001); Erosheva, Fienberg and Lafferty (2004); Gruber, Rosen-Zvi 
and Weiss (2008); Nallapati and Cohen (2008); Sinkkonen, Aukia and Kaski 
(2008)]. Through this index, such approaches encode the observed training 
data into the model and thus cannot generalize to observations outside of 
them. Link and word predictions for new documents, of the kind we evaluate 
in Section 4.1, are ill defined. 

Xu et al. (2006, 2008) have jointly modeled links and document content 
using nonparametric Bayesian techniques so as to avoid these problems. 
However, their work does not assume mixed- memberships, which have been 
shown to be useful for both document modeling [Blei, Ng and Jordan (2003)] 
and network modeling [Airoldi et al. (2008)]. Recent work from Nallapati et 
al. (2008) has also jointly modeled links and document content. We elucidate 
the subtle but important differences between their model and the RTM in 
Section 2.2. We then demonstrate in Section 4.1 that the RTM makes mod- 
eling assumptions that lead to significantly better predictive performance. 

The remainder of this paper is organized as follows. First, we describe the 
statistical assumptions behind the relational topic model. Then, we derive 
efficient algorithms based on variational methods for approximate posterior 
inference, parameter estimation, and prediction. Finally, we study the per- 
formance of the RTM on scientific citation networks, hyperlinked web pages, 
and geographically tagged news articles. The RTM provides better word pre- 
diction and link prediction than natural alternatives and the current state 
of the art. 

2. Relational topic models. The relational topic model (RTM) is a hier- 
archical probabilistic model of networks, where each node is endowed with 
attribute information. We will focus on text data, where the attributes are 
the words of the documents (see Figure 1). The RTM embeds this data in 
a latent space that explains both the words of the documents and how they 
are connected. 
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Fig. 1. Example data appropriate for the relational topic model. Each document is rep- 
resented as a bag of words and linked to other documents via citation. The RTM defines 
a joint distribution over the words in each document and the citation links between them. 



2.1. Modeling assumptions. The RTM builds on previous work in mixed- 
membership document models. Mixed-membership models are latent vari- 
able models of heterogeneous data, where each data point can exhibit mul- 
tiple latent components. Mixed-membership models have been successfully 
applied in many domains, including survey data [Erosheva, Fienberg and 
Joutard (2007)], image data [Barnard et al. (2003); Fei-Fei and Perona 
(2005)], rank data [Gormley and Murphy (2009)], network data [Airoldi 
et al. (2008)] and document modeling [Blei, Ng and Jordan (2003); Steyvers 
and Griffiths (2007)]. Mixed-membership models were independently devel- 
oped in the field of population genetics [Pritchard, Stephens and Donnelly 
(2000)]. 

To model node attributes, the RTM reuses the statistical assumptions 
behind latent Dirichlet allocation (LDA) [Blei, Ng and Jordan (2003)], a 
mixed-membership model of documents. 3 Specifically, LDA is a hierarchical 
probabilistic model that uses a set of "topics," distributions over a fixed vo- 
cabulary, to describe a corpus of documents. In its generative process, each 
document is endowed with a Dirichlet-distributed vector of topic propor- 
tions, and each word of the document is assumed drawn by first drawing a 
topic assignment from those proportions and then drawing the word from 
the corresponding topic distribution. While a traditional mixture model of 
documents assumes that every word of a document arises from a single mix- 



3 A general mixed-membership model can accommodate any kind of grouped data paired 
with an appropriate observation model [Erosheva, Fienberg and Lafferty (2004)]. 
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ture component, LDA allows each document to exhibit multiple components 
via the latent topic proportions vector. 

In the RTM, each document is first generated from topics as in LDA. 
The links between documents are then modeled as binary variables, one 
for each pair of documents. These binary variables are distributed accord- 
ing to a distribution that depends on the topics used to generate each of 
the constituent documents. Because of this dependence, the content of the 
documents is statistically connected to the link structure between them. 
Thus, each document's mixed-membership depends both on the content of 
the document as well as the pattern of its links. In turn, documents whose 
memberships are similar will be more likely to be connected under the model. 

The parameters of the RTM are as follows: the topics /3 1: j^ , K multino- 
mial parameters each describing a distribution on words; a i^-dimensional 
Dirichlet parameter a; and a function ip that provides binary probabilities. 
(This function is explained in detail below.) We denote a set of observed doc- 
uments by wi-d,i-.n, where u>i,i : 7v are the words of the ith document. (Words 
are assumed to be discrete observations from a fixed vocabulary.) We denote 
the links between the documents as binary variables yi-.D,i:D, where j/j , is 
one if there is a link between the ith and jth document. The RTM assumes 
that a set of observed documents wud,i-.n an d binary links between them 
Vi-.di-.d are generated by the following process: 

1. For each document d: 

(a) Draw topic proportions 6 d \a ~ Dir(a). 

(b) For each word Wd, n - 

i. Draw assignment z dn \9 d ~ Mult^). 

ii. Draw word w d>n \zd, n , Puk ~ Mult^^J. 

2. For each pair of documents d, d!\ 
(a) Draw binary link indicator 

yd,d'\ z d,Zd> ^ip(-\z d ,z d i,ri), 

where z d = {z d ,i,z dj2 , • • • , z d>n }. 

Figure 2 illustrates the graphical model for this process for a single pair of 
documents. The full model, which is difficult to illustrate in a small graphical 
model, contains the observed words from all D documents, and D 2 link 
variables for each possible connection between them. 

2.2. Link probability function. The function tp is the link probability func- 
tion that defines a distribution over the link between two documents. This 
function is dependent on the two vectors of topic assignments that generated 
their words, z d and z d i. 

This modeling decision is important. A natural alternative is to model 
links as a function of the topic proportions vectors 9 d and 9 d >. One such 
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model is that of Nallapati et al. (2008), which extends the mixed- membership 
stochastic blockmodel [Airoldi et al. (2008)] to generate node attributes. 
Similar in spirit is the nongenerative model of Mei et al. (2008) which "reg- 
ularizes" topic models with graph information. The issue with these formula- 
tions is that the links and words of a single document are possibly explained 
by disparate sets of topics, thereby hindering their ability to make predic- 
tions about words from links and vice versa. 

In enforcing that the link probability function depends on the latent topic 
assignments and z<f/ , we enforce that the specific topics used to generate 
the links are those used to generate the words. A similar mechanism is 
employed in Blei and McAuliffe (2007) for nonpair-wise response variables. 
In estimating parameters, this means that the same topic indices describe 
both patterns of recurring words and patterns in the links. The results in 
Section 4.1 show that this provides a superior prediction mechanism. 

We explore four specific possibilities for the link probability function. 
First, we consider 

(2.1) iP a {y = l) = a(r] T {z d oz d ,) + v), 

where = ^ n z^ n) the o notation denotes the Hadamard (element- wise) 
product, and the function a is the sigmoid. This link function models each 
per-pair binary variable as a logistic regression with hidden covariates. It 
is parameterized by coefficients rj and intercept v. The covariates are con- 
structed by the Hadamard product of z^ and z> , which captures similarity 
between the hidden topic representations of the two documents. 




Fig. 2. A two-document segment of the RTM. The variable y d ^ d i indicates whether the 
two documents are linked. The complete model contains this variable for each pair of doc- 
uments. This binary variable is generated contingent on the topic assignments for the 
participating documents, z d and z^/, and global regression parameters rj. The plates in- 
dicate replication. This model captures both the words and the link structure of the data 
shown in Figure 1. 
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Second, we consider 

(2.2) 4>e{y = 1) =exp(T7 T (z d oz d ,) +v). 

Here, ip e uses the same covariates as ip a , but has an exponential mean func- 
tion instead. Rather than tapering off when z d and z d i are close (i.e., when 
their weighted inner product, T7 T (z^ o z d /), is large), the probabilities re- 
turned by this function continue to increase exponentially. With some alge- 
braic manipulation, the function ip e can be viewed as an approximate variant 
of the modeling methodology presented in Blei and Jordan (2003). 
Third, we consider 

(2.3) ^{y = l) = ${r 1 T {z d oz dl ) + v), 

where $ represents the cumulative distribution function of the Normal dis- 
tribution. Like ipa, this link function models the link response as a regression 
parameterized by coefficients r\ and intercept v. The covariates are also con- 
structed by the Hadamard product of z d and z$ , but instead of the logit 
model hypothesized by tpu, ip§ models the link probability with a probit 
model. 

Finally, we consider 

(2.4) tp N {y = 1) = exp(-T7 T (z d - z>) o (z d - z>) - v). 

Note that ipN is the only one of the link probability functions which is not 
a function of z^oz^. Instead, it depends on a weighted squared Euclidean 
difference between the two latent topic assignment distributions. Specifically, 
it is the multivariate Gaussian density function, with mean and diagonal 
covariance characterized by rj, applied to z d — z d i . Because the range of 
z d — z d i is finite, the probability of a link, tpN^y = l)> i s a l so finite. We 
constrain the parameters and v to ensure that it is between zero and one. 

All four of the ip functions we consider are plotted in Figure 3. The link 
likelihoods suggested by the link probability functions are plotted against the 
inner product of z d and z d i . The parameters of the link probability functions 
were chosen to ensure that all curves have the same endpoints. Both and 
ip$ have similar sigmoidal shapes. In contrast, the ip e is exponential in shape 
and its slope remains large at the right limit. The one-sided Gaussian form 
of ipN is also apparent. 

3. Inference, estimation and prediction. With the model defined, we 
turn to approximate posterior inference, parameter estimation, and predic- 
tion. We develop a variational inference procedure for approximating the 
posterior. We use this procedure in a variational expectation-maximization 
(EM) algorithm for parameter estimation. Finally, we show how a model 
whose parameters have been estimated can be used as a predictive model of 
words and links. 
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Fig. 3. A comparison of different link probability functions. The plot shows the probability 
of two documents being linked as a function of their similarity ( as measured by the inner 
product of the two documents' latent topic assignments). All link probability functions were 
parameterized so as to have the same endpoints. 



3.1. Inference. The goal of posterior inference is to compute the poste- 
rior distribution of the latent variables conditioned on the observations. As 
with many hierarchical Bayesian models of interest, exact posterior inference 
is intractable and we appeal to approximate inference methods. Most previ- 
ous work on latent variable network modeling has employed Markov Chain 
Monte Carlo (MCMC) sampling methods to approximate the posterior of in- 
terest [Hoff, Raftery and Handcock (2002); Kemp, Griffiths and Tenenbaum 
(2004)]. Here, we employ variational inference [Jordan et al. (1999); Wain- 
wright and Jordan (2005)], a deterministic alternative to MCMC sampling 
that has been shown to give comparative accuracy to MCMC with improved 
computational efficiency [Blei and Jordan (2006); Braun and McAuliffe (2007)]. 
Wainwright and Jordan (2008) investigate the properties of variational ap- 
proximations in detail. Recently, variational methods have been employed in 
other latent variable network models [Hofman and Wiggins (2007); Airoldi 
et al. (2008)]. 

In variational methods, we posit a family of distributions over the latent 
variables, indexed by free variational parameters. Those parameters are then 
fit to be close to the true posterior, where closeness is measured by relative 
entropy. For the RTM, we use the fully-factorized family, where the topic 
proportions and all topic assignments are considered independent, 



(3.1) g (e,Z| 7) #) = n Q9(0dhd)l[q z (z d ,n\cp 



The parameters 7 are variational Dirichlet parameters, one for each docu- 
ment, and are variational multinomial parameters, one for each word in 
each document. Note that E g [^ n ] = <pd,n- 
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Minimizing the relative entropy is equivalent to maximizing the Jensen's 
lower bound on the marginal probability of the observations, that is, the 
evidence lower bound (ELBO), 

^2 E q[ l0 gP(ydi,d2\ z d 1 ,Zd 2 ,V,v)] +^2^2^q[ lo SP(zd,n\O d )] 

(di,<fa) d n 

(3.2) 

:Ki z d,n)\ + 
d n d 

where {d\,d,2j denotes all document pairs and H(g) denotes the entropy 
of the distribution q. The first term of the ELBO differentiates the RTM 
from LDA [Blei, Ng and Jordan (2003)]. The connections between docu- 
ments affect the objective in approximate posterior inference (and, below, 
in parameter estimation). 

We develop the inference procedure below under the assumption that only 
observed links will be modeled (i.e., j/di,d2 ^ s either 1 or unobserved). We 
do this for both methodological and computational reasons. 

First, while one can fix d 2 = 1 whenever a link is observed between 
d\ and cfo and set ydi,d 2 = otherwise, this approach is inappropriate in 
corpora where the absence of a link cannot be construed as evidence for 
Vd\,d2 = 0- I n these cases, treating these links as unobserved variables is 
more faithful to the underlying semantics of the data. For example, in large 
social networks such as Facebook the absence of a link between two people 
does not necessarily mean that they are not friends; they may be real friends 
who are unaware of each other's existence in the network. Treating this link 
as unobserved better respects our lack of knowledge about the status of their 
relationship. 

Second, treating nonlinks links as hidden decreases the computational 
cost of inference; since the link variables are leaves in the graphical model, 
they can be removed whenever they are unobserved. Thus, the complexity of 
computation scales linearly with the number of observed links rather than 
the number of document pairs. When the number of true observations is 
sparse relative to the number of document pairs, as is typical, this provides 
a significant computational advantage. For example, on the Cora data set 
described in Section 4, there are 3,665,278 unique document pairs but only 
5278 observed links. Treating nonlinks as hidden in this case leads to an 
inference procedure which is nearly 700 times faster. 



4 Sums over document pairs (di, cfe) are understood to range over pairs for which a link 
has been observed. 
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Our aim now is to compute each term of the objective function given in 
equation (3.2). The first term, 

(3.3) ^d u d 2 = ^li^Piydi^d!,^^^)}, 

(di,d 2 ) {duck) 

depends on our choice of link probability function. For many link probability 
functions, this term cannot be expanded analytically. However, if the link 
probability function depends only on z dl o z d2 , we can expand the expec- 
tation using the following first-order approximation [Braun and McAuliffe 
(2007)] 5 : 

%{dudi) = E g[l°gV'(zd 1 °z<i 2 )] «log^(E 9 [z dl oz d2 ]) = log tfj(W dl42 ), 

where W dl ,d 2 = 4>d ± ° <t>d 2 and (p d = E q [z d ] = E n ^d,n- In this work, we 
explore three functions which can be written in this form, 

Ejlog^ CT (z rfl o z d2 )] ps loga(rj T 7f dud2 + u), 

(3.4) E 9 [log^$(z dl oz d2 )} PS log^(Tj T 7r dud2 + v), 

E 9 [log^ e (z dl o z d2 )] = rj T 7f dud2 + V. 

Note that for if) e the expression is exact. The likelihood when ip^ is chosen 
as the link probability function can also be computed exactly, 

E fy [logVjv(z dl ,z d2 )] = -v - 5^7fc((0 dl>i - (Pd 2 ,i) 2 + Var(z dl)i ) + Var(z d2ii )), 

i 

where z d ^ denotes the ith element of the mean topic assignment vector, z d , 
and Var(z rf j) = 4j Yin <l>d,n,i0- ~ <Pd,n,i), where (f> d n i is the ith element of the 
multinomial parameter (f> d ,n- (See Appendix A.) 

Leveraging these expanded expectations, we then use coordinate ascent 
to optimize the ELBO with respect to the variational parameters 7, <&. This 
yields an approximation to the true posterior. The update for the variational 
multinomial <f> d j is 

(3.5) <f> d>j ocexpj Y +E g [log0 d |7d] + log/3. >ro 1. 

The contribution to the update from link information, V^ d n J2?d j( f , depends 
on the choice of link probability function. For the link probability functions 

5 While we do not give a detailed proof here, the error of a first-order approximation 
is closely related to the probability mass in the tails of the distribution on Zd 1 and Zd 2 ■ 
Because the number of words in a document is typically large, the variance of za 1 and Zd 2 
tends to be small, making the first-order approximation a good one. 
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expanded in equation (3.4), this term can be written as 







(3-6) V^JSV = (V Wdl , d2 -SV) o ^. 

Intuitively, equation (3.6) will cause a document's latent topic assignments 
to be nudged in the direction of neighboring documents' latent topic as- 
signments. The magnitude of this pull depends only on Tfa,d'i that is, some 
measure of how close they are already. The corresponding gradients for the 
functions in equation (3.4) are 

V^ d ,J^V ~ (1 - + u)) V , 

The gradient when ipw is the link probability function is 
(3-7) V^ifjv, = J-r, o ^ - d _ n - i 

where 4>d-n = 4>d~ 7^0d,n- Similar in spirit to equation (3.6), equation (3.7) 
will cause a document's latent topic assignments to be drawn toward those 
of its neighbors. This draw is tempered by cj) d _ n , a measure of how similar 
the current document is to its neighbors. 

The contribution to the update in equation (3.5) from the word evidence 
log/9. w . can be computed by taking the element-wise logarithm of the 
Wdjth column of the topic matrix (3. The contribution to the update from 
the document's latent topic proportions is given by 

E q [lo g e d \j d ] = *hd)-v(52ld 

where \l/ is the digamma function. (A digamma of a vector is the vector of 
digammas.) The update for -y is identical to that in variational inference for 
LDA [Blei, Ng and Jordan (2003)], 

n 

These updates are fully derived in Appendix A. 



3.2. Parameter estimation. We fit the model by finding maximum like- 
lihood estimates for each of the parameters: multinomial topic vectors /3 1: ^ 
and link function parameters r),i>. Once again, this is intractable so we 
turn to an approximation. We employ variational expectation-maximization, 
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where we iterate between optimizing the ELBO of equation (3.2) with re- 
spect to the variational distribution and with respect to the model param- 
eters. This is equivalent to the usual expectation-maximization algorithm 
[Dempster, Laird and Rubin (1977)], except that the computation of the 
posterior is replaced by variational inference. 

Optimizing with respect to the variational distribution is described in 
Section 3.1. Optimizing with respect to the model parameters is equivalent 
to maximum likelihood estimation with expected sufficient statistics, where 
the expectation is taken with respect to the variational distribution. 

The update for the topics matrix (3 is 

(3.8) (3 kjW (x^2^2t(w d , n = w)4> d ^ k . 

d n 

This is the same as the variational EM update for LDA [Blei, Ng and Jordan 
(2003)]. In practice, we smooth our estimates of (3k, w using pseudocount 
smoothing [Jurafsky and Martin (2008)] which helps to prevent overfitting 
by positing a Dirichlet prior on j3 k . 

In order to fit the parameters rj, v of the logistic function of equation (2.1), 
we employ gradient-based optimization. Using the approximation described 
in equation (3.4), we compute the gradient of the objective given in equa- 
tion (3.2) with respect to these parameters, 

a(rf T W dl ,d 2 + y)]. 

Note that these gradients cannot be used to directly optimize the pa- 
rameters of the link probability function without negative observations (i.e., 
Vd\,di = 0)- We address this by applying a regularization penalty. This reg- 
ularization penalty along with parameter update procedures for the other 
link probability functions are given in Appendix B. 

3.3. Prediction. With a fitted model, our ultimate goal is to make pre- 
dictions about new data. We describe two kinds of prediction: link prediction 
from words and word prediction from links. 

In link prediction, we are given a new document (i.e., a document which 
is not in the training set) and its words. We are asked to predict its links to 
the other documents. This requires computing 

p(yd,d>\wd, Wd') = ^2 p{yd,d'\zd,z d ')p(z d ,z d/ \w d ,w d ,), 



(<fi,d2) 

d 

— ^ Yl 

(di,<k) 
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an expectation with respect to a posterior that we cannot compute. Using the 
inference algorithm from Section 3.1, we find variational parameters which 
optimize the ELBO for the given evidence, that is, the words and links 
for the training documents and the words in the test document. Replacing 
the posterior with this approximation g(0,Z), the predictive probability is 
approximated with 

(3.9) p(yd,d'\ w d,Wd>) ~E g [p(y d)d /|z d) z d /)]. 

In a variant of link prediction, we are given a new set of documents (doc- 
uments not in the training set) along with their words and asked to select 
the links most likely to exist. The predictive probability for this task is 
proportional to equation (3.9). 

The second predictive task is word prediction, where we predict the words 
of a new document based only on its links. As with link prediction, p(wd t i\yd) 
cannot be computed. Using the same technique, a variational distribution 
can approximate this posterior. This yields the predictive probability 

P(Wd,i\yd) ~E q \p(wd,i\z d ,i)}. 

Note that models which treat the endpoints of links as discrete obser- 
vations of data indices cannot participate in the two tasks presented here. 
They cannot make meaningful predictions for documents that do not ap- 
pear in the training set [Cohn and Hofmann (2001); Erosheva, Fienberg and 
Lafferty (2004); Nallapati and Cohen (2008); Sinkkonen, Aukia and Kaski 
(2008)]. By modeling both documents and links generatively, our model is 
able to give predictive distributions for words given links, links given words, 
or any mixture thereof. 

4. Empirical results. We examined the RTM on four data sets. 6 Words 
were stemmed; stop words, that is, words like "and," "of," or "but," and 
infrequently occurring words were removed. Directed links were converted 
to undirected links 7 and documents with no links were removed. The Cora 
data [McCallum et al. (2000)] contains abstracts from the Cora computer 
science research paper search engine, with links between documents that 
cite each other. The WebKB data [Craven et al. (1998)] contains web pages 
from the computer science departments of different universities, with links 
determined from the hyperlinks on each page. The PNAS data contains 
recent abstracts from the Proceedings of the National Academy of Sciences. 
The links between documents are intra- PNAS citations. The LocalNews data 



An implementation of the RTM with accompanying data can be found at 
http://cran.r-project.org/web/packages/lda/. 

7 The RTM can be extended to accommodate directed connections. Here we modeled 
undirected links. 
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Table 1 

Summary statistics for the four data sets after processing 



Data set 


# of documents 


# of words 


Number of links 


Lexicon size 


Cora 


2708 


49216 


5278 


1433 


WebKB 


877 


79365 


1388 


1703 


PNAS 


2218 


11,9162 


1577 


2239 


LocalNews 


51 


93765 


107 


1242 



set is a corpus of local news culled from various media markets throughout 
the United States. We create one bag-of-words document associated with 
each state (including the District of Columbia); each state's "document" 
consists of headlines and summaries from local news in that state's media 
markets. Links between states were determined by geographical adjacency. 
Summary statistics for these data sets are given in Table 1. 

4.1. Evaluating the predictive distribution. As with any probabilistic 
model, the RTM defines a probability distribution over unseen data. Af- 
ter inferring the latent variables from data (as described in Section 3.1), 
we ask how well the model predicts the links and words of unseen nodes. 
Models that give higher probability to the unseen documents better capture 
the joint structure of words and links. 

We study the RTM with three link probability functions discussed above: 
the logistic link probability function, of equation (2.1); the exponential 
link probability function, ip e , of equation (2.2); and the probit link proba- 
bility function, of equation (2.3). We compare these models against two 
alternative approaches. 

The first ("Pairwise Link-LDA") is the model proposed by Nallapati et 
al. (2008), which is an extension of the mixed membership stochastic block 
model [Airoldi et al. (2008)] to model network structure and node attributes. 
This model posits that each link is generated as a function of two individ- 
ual topics, drawn from the topic proportions vectors associated with the 
endpoints of the link. Because latent topics for words and links are drawn 
independently in this model, it cannot ensure that the discovered topics 
are representative of both words and links simultaneously. Additionally, this 
model introduces additional variational parameters for every link which adds 
computational complexity. 

The second ("LDA + Regression") first fits an LDA model to the docu- 
ments and then fits a logistic regression model to the observed links, with 
input given by the Hadamard product of the latent class distributions of 
each pair of documents. Rather than performing dimensionality reduction 
and regression simultaneously, this method performs unsupervised dimen- 
sionality reduction first, and then regresses to understand the relationship 
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Predictive Link Rank Predictive Word Rank 






Fig. 4. Average held-out predictive link rank (left) and word rank (right) as a function 
of the number of topics. Lower is better. For all three corpora, RTMs outperform baseline 
unigram, LDA and "Pairwise Link-LDA" Nallapati et al. (2008). 



between the latent space and underlying link structure. All models were fit 
such that the total mass of the Dirichlet hyperparameter a was 1.0. (While 
we omit a full sensitivity study here, we observed that the performance of 
the models was similar for a within a factor of 2 above and below the value 
we chose.) 

We measured the performance of these models on link prediction and 
word prediction (see Section 3.3). We divided the Cora, WebKB and PNAS 
data sets each into five folds. For each fold and for each model, we ask two 
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predictive queries: given the words of a new document, how probable are its 
links; and given the links of a new document, how probable are its words? 
Again, the predictive queries are for completely new test documents that are 
not observed in training. During training the test documents are removed 
along with their attendant links. We show the results for both tasks in terms 
of predictive rank as a function of the number of topics in Figure 4. (See 
Section 5 for a discussion on potential approaches for selecting the number of 
topics and the Dirichlet hyperparameter a.) Here we follow the convention 
that lower predictive rank is better. 

In predicting links, the three variants of the RTM perform better than all 
of the alternative models for all of the data sets (see Figure 4, left column). 
Cora is paradigmatic, showing a nearly 40% improvement in predictive rank 
over baseline and 25% improvement over LDA + Regression. The perfor- 
mance for the RTM on this task is similar for all three link probability 
functions. We emphasize that the links are predicted to documents seen in 
the training set from documents which were held out. By incorporating link 
and node information in a joint fashion, the model is able to generalize to 
new documents for which no link information was previously known. 

Note that the performance of the RTM on link prediction generally in- 
creases as the number of topics is increased (there is a slight decrease on 
WebKB). In contrast, the performance of the Pairwise Link-LDA worsens as 
the number of topics is increased. This is most evident on Cora, where Pair- 
wise Link-LDA is competitive with RTM at five topics, but the predictive 
link rank monotonically increases after that despite its increased dimension- 
ality (and commensurate increase in computational difficulty). We hypoth- 
esize that Pairwise Link-LDA exhibits this behavior because it uses some 
topics to explain the words observed in the training set, and other topics to 
explain the links observed in the training set. This problem is exacerbated 
as the number of topics is increased, making it less effective at predicting 
links from word observations. 

In predicting words, the three variants of the RTM again outperform all 
of the alternative models (see Figure 4, right column). This is because the 
RTM uses link information to influence the predictive distribution of words. 
In contrast, the predictions of LDA + Regression and Pairwise Link-LDA 
barely use link information; thus, they give predictions independent of the 
number of topics similar to those made by a simple unigram model. 

4.2. Automatic link suggestion. A natural real-world application of link 
prediction is to suggest links to a user based on the text of a document. 
One might suggest citations for an abstract or friends for a user in a social 
network. 

As a complement to the quantitative evaluation of link prediction given 
in the previous section, Table 2 illustrates suggested citations using RTM 
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Table 2 

Top eight link predictions made by RTM (ijj e ) and LDA + Regression for two documents 
(italicized) from Cora. The models were fit with 10 topics. Boldfaced titles indicate actual 
documents cited by or citing each document. Over the whole corpus, RTM improves 
precision over LDA + Regression by 80% when evaluated on the first 20 documents 

retrieved 



Markov chain Monte Carlo convergence diagnostics: A comparative review 

Minorization conditions and convergence rates for Markov chain Monte Carlo 

Rates of convergence of the Hastings and Metropolis algorithms 
^ Possible biases induced by MCMC convergence diagnostics 

-—' Bounding convergence time of the Gibbs sampler in Bayesian image restoration 
g Self regenerative Markov chain Monte Carlo 

Ph Auxiliary variable methods for Markov chain Monte Carlo with applications 
Rate of Convergence of the Gibbs Sampler by Gaussian Approximation 

Diagnosing convergence of Markov chain Monte Carlo algorithms 

Exact Bound for the Convergence of Metropolis Chains 
o Self regenerative Markov chain Monte Carlo 

co Minorization conditions and convergence rates for Markov chain Monte Carlo 

bi Gibbs-Markov models 

CD 

Auxiliary variable methods for Markov chain Monte Carlo with applications 

+ Markov Chain Monte Carlo Model Determination for Hierarchical and Graphical Models 

*J Mediating instrumental variables 
Q 

A qualitative framework for probabilistic inference 
Adaptation for Self Regenerative MCMC 



3 



Competitive environments evolve better solutions for complex tasks 

Coevolving High Level Representations 

A Survey of Evolutionary Strategies 

Genetic Algorithms in Search, Optimization and Machine Learning 
Strongly typed genetic programming in evolving cooperation strategies 

Solving combinatorial problems using evolutionary algorithms 
Ph A promising genetic algorithm approach to job-shop scheduling. . . 

Evolutionary Module Acquisition 

An Empirical Investigation of Multi-Parent Recombination Operators. . . 

a A New Algorithm for DNA Sequence Assembly 

•§ Identification of protein coding regions in genomic DNA 

£ Solving combinatorial problems using evolutionary algorithms 

a? A promising genetic algorithm approach to job-shop scheduling. . . 

. A genetic algorithm for passive management 

_j The Performance of a Genetic Algorithm on a Chaotic Objective Function 

Q Adaptive global optimization with local search 
Mutation rates as adaptations 
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(ip e ) and LDA + Regression as predictive models. These suggestions were 
computed from a model fit on one of the folds of the Cora data using 10 top- 
ics. (Results are qualitatively similar for models fit using different numbers 
of topics; see Section 5 for strategies for choosing the number of topics.) 
The top results illustrate suggested links for "Markov chain Monte Carlo 
convergence diagnostics: A comparative review," which occurs in this fold's 
training set. The bottom results illustrate suggested links for "Competitive 
environments evolve better solutions for complex tasks," which is in the test 
set. 

RTM outperforms LDA + Regression in being able to identify more true 
connections. For the first document, RTM finds 3 of the connected docu- 
ments versus 1 for LDA + Regression. For the second document, RTM finds 
3 while LDA + Regression does not find any. This qualitative behavior is 
borne out quantitatively over the entire corpus. Considering the precision of 
the first 20 documents retrieved by the models, RTM improves precision over 
LDA + Regression by 80%. (Twenty is a reasonable number of documents 
for a user to examine.) 

While both models found several connections which were not observed 
in the data, those found by the RTM are qualitatively different. In the first 
document, both sets of suggested links are about Markov chain Monte Carlo. 
However, the RTM finds more documents relating specifically to convergence 
and stationary behavior of Monte Carlo methods. LDA + Regression finds 
connections to documents in the milieu of MCMC, but many are only indi- 
rectly related to the input document. The RTM is able to capture that the 
notion of "convergence" is an important predictor for citations, and has ad- 
justed the topic distribution and predictors correspondingly. For the second 
document, the documents found by the RTM are also of a different nature 
than those found by LDA + Regression. All of the documents suggested 
by RTM relate to genetic algorithms. LDA + Regression, however, suggests 
some documents which are about genomics. By relying only on words, LDA 
+ Regression conflates two "genetic" topics which are similar in vocabulary 
but different in citation structure. In contrast, the RTM partitions the latent 
space differently, recognizing that papers about DNA sequencing are unlikely 
to cite papers about genetic algorithms, and vice versa. Better modeling the 
properties of the network jointly with the content of the documents, the 
model is able to better tease apart the community structure. 

4.3. Modeling spatial data. While explicitly linked structures like cita- 
tion networks offer one sort of connectivity, data with spatial or temporal 
information offer another sort of connectivity. In this section we show how 
RTMs can be used to model spatially connected data by applying it to the 
LocalNews data set, a corpus of news headlines and summaries from each 
state, with document linkage determined by spatial adjacency. 
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Figure 5 shows the per state topic distributions inferred by RTM (left) 
and LDA (right). Both models were fit with five topics using the same ini- 
tialization. (We restrict the discussion here to five topics for expositional 
convenience. See Section 5 for a discussion on potential approaches for select- 
ing the number of topics.) While topics are, strictly speaking, exchangeable 
and therefore not comparable between models, using the same initialization 
typically yields topics which are amenable to comparison. Each row of Fig- 
ure 5 shows a single component of each state's topic proportion for RTM 
and LDA. That is, if 9 S is the latent topic proportions vector for state s, 
then 6 S \ governs the intensity of that state's color in the first row, 9 S 2 the 
second, and so on. 

While both RTM and LDA model the words in each state's local news 
corpus, LDA ignores geographical information. Hence, it finds topics which 
are distributed over a wide swath of states which are often not contiguous. 
For example, LDA's topic 1 is strongly expressed by Maine and Illinois, 
along with Texas and other states in the South and West. In contrast, RTM 
only assigns nontrivial mass to topic 1 in Southern states. Similarly, LDA 
finds that topic 5 is expressed by several states in the Northeast and the 
West. The RTM, however, concentrates topic 4's mass on the Northeastern 
states. 

The RTM does so by finding different topic assignments for each state and, 
commensurately, different distributions over words for each topic. Table 3 
shows the top words in each RTM topic and each LDA topic. Words are 
ranked by the following score: 



The score finds words which are likely to appear in a topic, but also cor- 
rects for frequent words. The score therefore puts greater weight on words 
which more easily characterize a topic. Table 3 shows that RTM finds words 
more geographically indicative. While LDA provides one way of analyzing 
this collection of documents, the RTM enables a different approach which 
is geographically cognizant. For example, LDA's topic 3 is an assortment 
of themes associated with California (e.g., "marriage") as well as others 
("scores," "registration," "schools"). The RTM, on the other hand, discovers 
words thematically related to a single news item ("measure," "protesters," 
"appeals" ) local to California. The RTM typically finds groups of words as- 
sociated with specific news stories, since they are easily localized, while LDA 
finds words which cut broadly across news stories in many states. Thus, on 
topic 5, the RTM discovers key words associated with news stories local 
to the Northeast such as "manslaughter" and "developer." On topic 5, the 
RTM also discovers a peculiarity of the Northeastern dialect: that roads are 
given the appellation "route" more frequently than elsewhere in the country. 
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Topic 5 Topic 5 



Fig. 5. A comparison between RTM (left) and LDA (right) of topic distributions on local 
news data. Each color/row depicts a single topic. Each state's color intensity indicates the 
magnitude of that topic's component. The corresponding words associated with each topic 
are given in Table 3. Whereas LDA finds geographically diffuse topics, RTM, by modeling 
spatial connectivity, finds coherent regions. 

By combining textual information along with geographical information, 
the RTM provides a novel exploratory tool for identifying clusters of words 
that are driven by both word co-occurrence and geographic proximity. Note 
that the RTM finds regions in the United States which correspond to typical 
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Table 3 

The top eight words in each RTM (left) and LDA (right) topic shown in Figure 5 ranked 
by score (defined below). RTM finds words which are predictive of both a state's 
geography and its local news 



comments 


dead 


scores 


landfill 


plane 


metro 


courthouse 


evidence 



election 


plane 


landfill 


dead 


police 


union 


interests 


veterans 



Topic 1 



Topic 1 



crash 


yesterday 


registration 


county 


police 


children 


quarter 


campaign 



Topic 2 



crash 


police 


yesterday 


judge 


fire 


leave 


charges 


investors 



Topic 2 



measure 


marriage 


suspect 


officer 


guards 


protesters 


appeals 


finger 



Topic 3 



comments 


marriage 


register 


scores 


schools 


comment 


registration 


rights 



Topic 3 



bridge 


area 


veterans 


winter 


city 


snow 


deer 


concert 



Topic 4 



snow 


city 


veterans 


votes 


winter 


bridge 


recount 


lion 



Topic 4 



manslaughter 


route 


girls 


state 


knife 


grounds 


committee 


developer 



Topic 5 



garage 


girls 


video 


dealers 


underage 


housing 


mall 


union 



Topic 5 



clusterings of states: the South, the Northeast, the Midwest, etc. Further, 
the soft clusterings found by RTM confirm many of our cultural intuitions — 
while New York is definitively a Northeastern state, Virginia occupies a 
liminal space between the MidAtlantic and the South. 

5. Discussion. There are many avenues for future work on relational 
topic models. Applying the RTM to diverse types of "documents" such as 
protein-interaction networks or social networks, whose node attributes are 
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governed by rich internal structure, is one direction. Even the text docu- 
ments which we have focused on in this paper have internal structure such 
as syntax [Boyd-Graber and Blei (2008)] which we are discarding in the 
bag-of-words model. Augmenting and specializing the RTM to these cases 
may yield better models for many application domains. 

As with any parametric mixed-membership model, the number of la- 
tent components in the RTM must be chosen using either prior knowledge 
or model-selection techniques such as cross-validation. Incorporating non- 
parametric Bayesian priors such as the Dirichlet process into the model 
would allow it to flexibly adapt the number of topics to the data [Ferguson 
(1973); Antoniak (1974); Kemp, Griffiths and Tenenbaum (2004); Teh et 
al. (2007)]. This, in turn, may give researchers new insights into the latent 
membership structure of networks. 

In sum, the RTM is a hierarchical model of networks and per-node at- 
tribute data. The RTM is used to analyze linked corpora such as citation 
networks, linked web pages, social networks with user profiles, and geograph- 
ically tagged news. We have demonstrated qualitatively and quantitatively 
that the RTM provides an effective and useful mechanism for analyzing and 
using such data. It significantly improves on previous models, integrating 
both node-specific information and link structure to give better predictions. 

APPENDIX A: DERIVATION OF COORDINATE ASCENT UPDATES 

Inference under the variational method amounts to finding values of the 
variational parameters 7, <1? which optimize the evidence lower bound, jSf, 
given in equation (3.2). To do so, we first expand the expectations in these 
terms: 



(dl,d 2 ) d n 



+ ££^,» T (*(7«i)-l*(l T 7«i)) 



d n 



+ ^( a -l) T (*( 7d )-l*(l T 7(i )) 



d 



(A.l) 




</ 



-J](7<z-l) T (*(7d)-l*(l T 7d)) 



d 



+ ^i T iogr( 7d )-iogr(i T 7rf ), 



</ 
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where -% ll( i 2 is defined as in equation (3.3). Since ^di,ds is independent of 
■j, we can collect all of the terms associated with 7^ into 

^ d =U + J2^d,n-ld) (*(7d)-l*(l T 7d)) 



+ i 1 io g r(7 d )-io g r(i 1 7d)- 

Taking the derivatives and setting equal to zero leads to the following opti- 
mally condition: 

U + <t>d,n - Id) (*'(jd) - l*'(l T 7.i)) = 0, 

^ n ' 

which is satisfied by the update 

(A.2) 7d^a + J^0d,n- 



it 



In order to derive the update for (j) dn , we also collect its associated terms, 

■^cf d ,n = ^,n T (log^,n + log/3.,^ + *( 7d ) " 1^ Id)) + Yl ^,d' " 

d'jtd 

Adding a Lagrange multiplier to ensure that (f> dyTl normalizes and setting the 
derivative equal to zero leads to the following condition: 

(A.3) <p d)n oc exp{log/3. jlUd n + *( 7d ) - l^(l T 7 d ) + V^Jfd,*}- 

The exact form of ^6 dn -^'d,d' w iU depend on the link probability function 
chosen. If the expected log link probability depends only on 7f dl)d2 = (p dl o 
(f) d2 , the gradients are given by equation (3.6). When t/V is chosen as the 
link probability function, we expand the expectation, 

E q [logip N (z d ,z d /)} = -ri T Eq[(z d -z d <) o (z d -z d >)] - v 

(A.4) 

i 

Because each word is independent under the variational distribution, lE 9 [z^ J = 
Var(z dji ) + <f> di , where Var(z rfii ) = 4j ^ tt ^ ni! (l - <j) d , n ,i)- The gradient of 

d 

this expression is given by equation (3.7). 

APPENDIX B: DERIVATION OF PARAMETER ESTIMATES 

In order to estimate the parameters of our model, we find values of the 
topic multinomial parameters j3 and link probability parameters 77, v which 
maximize the variational objective, Jzf, given in equation (3.2). 
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To optimize (3, it suffices to take the derivative of the expanded objec- 
tive given in equation (A.l) along with a Lagrange multiplier to enforce 
normalization: 



1 



d Pk ^ = /_Sl^i ^d^kHw = Wd,n)~o 1" V 



Setting this quantity equal to zero and solving yields the update given in 
equation (3.8). 

By taking the gradient of equation (A.l) with respect to r\ and u, we 
can also derive updates for the link probability parameters. When the ex- 
pectation of the logarithm of the link probability function depends only 
on r^TT^d' + v , as with all the link functions given in equation (3.4), then 
these derivatives take a convenient form. For notational expedience, denote 
V + = (Vi l/ ) an d ~™~dd' = frd,d'i !)• Then the derivatives can be written as 

« (1 - afa+ T »J* ))tF+ 



(B.l) V„ + ^f c 



<1> 



l d,d'J 

^ri+-^d,d' = ^d4" 

Note that all of these gradients are positive because we are faced with a 
one-class estimation problem. Unchecked, the parameter estimates will di- 
verge. While a variety of techniques exist to address this problem, one set 
of strategies is to add regularization. 

A common regularization for regression problems is the £2 regularizer. 
This penalizes the objective J2? with the term A||?7||2, where A is a free 
parameter. This penalization has a Bayesian interpretation as a Gaussian 
prior on 77. 

In lieu of or in conjunction with £2 regularization, one can also employ 
regularization which in effect injects some number of observations, p, for 
which the link variable y = 0. We associate with these observations a doc- 
ument similarity of W a = -^-^ o ^t^, the expected Hadamard product of 
any two documents given the Dirichlet prior of the model. Because both ip a 
and are symmetric, these gradients of these regularization terms can be 
written as 

V„ + ^ = -pa(T7 +T 7f+)^, 

While this approach could also be applied to ip e , here we use a different 
approximation. We do this for two reasons. First, we cannot optimize the 
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parameters of ifj e in an unconstrained fashion since this may lead to link 
functions which are not probabilities. Second, the approximation we propose 
will lead to explicit updates. 

Because Eg [log tp e (^d ° %d')] is linear in n dd i by equation (3.4), this sug- 
gests a linear approximation of E g [log(l — ip e (z d oz^))]. Namely, we let 

E g[log(l - ^e(z<i °%))] ~ v' T ™d,d> + v 1 . 
This leads to a penalty term of the form 

We fit the parameters of the approximation, v( ,v\ by making the approxi- 
mation exact whenever W d>d i = or m.ax.Tz dd i = 1. This yields the following 
K + 1 equations for the K + 1 parameters of the approximation: 

v' = \og(l -exp(z/)), 

rfi = log(l - exp(r?j + v)) - v' . 

Combining the gradient of the likelihood of the observations given in equa- 
tion (B.l) with the gradient of the penalty £% e and solving leads to the 
following updates: 

v <- log(M - l T n) - log(p(l - l T 7f Q ) + M - l T n), 

•q 4- iog(n) - iog(n + pw a ) - iu, 

where M = Yl( dl d 2 ) ^ an( ^ ^ = d 2 )~™ d i> d 2' Note that because of the 
constraints on our approximation, these updates are guaranteed to yield 
parameters for which < ip e < 1. 

Finally, in order to fit parameters for tp^, we begin by assuming the vari- 
ance terms of equation (A. 4) are small, equation (A. 4) can then be written 
as 

E q [\ogip N (z d ,z d >)] = -v- t] T ((f> d -<p d >)° (4>d -4>d>), 

which is the log likelihood of a Gaussian distribution where <fi d — <f> d i is 
random with mean and diagonal variance ^ . This suggests fitting rj using 
the empirically observed variance: 

M 

rj { = = = = — . 

2 Ed,d'(0d - 0dO ° (<f>d - fa) 

v acts as a scaling factor for the Gaussian distribution; here we want only 
to ensure that the total probability mass respects the frequency of observed 
links to regularization "observations." Equating the normalization constant 
of the distribution with the desired probability mass yields the update 

^ ^ log ivr /</2 + log(p + M) -logM- ±1 T log 77, 

guarding against values of v which would make tpN inadmissable as a prob- 
ability. 
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